Introduction

Los Angeles County provides openly available data on all restaurant and market inspections over the past 5 years. Facilities are subject to inspection 1 to 3 times a year, and made public within 1 week of inspection date. The frequency in which restaurants and food markets are inspected depends on the public health risk associated with the food products served or prepared and on the facility’s history of inspection grades. Inspectors deduct points based on violations and health risks, which is turned into a score out of 100. In addition, Los Angeles County data from 2018 on population health is publicly available. Demographic data, such as age and race distribution, socioeconomic data, such as proportion receiving EBT and proportion employed, and health outcomes data, such as proportion with asthma and rates of suicide, are provided for 87 cities within Los Angeles County.

Objective

I am interested in exploring restaurant inspection ratings in LA County. I have a few questions, with the main one being: Are restaurant inspection ratings associated with community health status? Secondary questions include: What are the “safest” and most “dangerous” cities in LA County for eating restaurant food? What restaurant chain is the “safest” to eat at? What restaurant chains should one proceed with caution?

Methods

Reading in and wrangling the data

I used 2 data sets which I merged together for this project. Both are available at data.lacounty.gov to download as CSV (I have also uploaded these datasets to my github repository). The first is a dataset of all LA County restaurant inspections. I added LA County city health data to see if there were any relationships between public health outcomes and local restaurant hygiene. These datasets were merged by city name, and no restaurant was missing its city in the first dataset. City names were briefly inspected to ensure matching would be feasible. As Los Angeles had many sub-cities for which there was health data, I only included the “City of Los Angeles” data to represent the local public health for any restaurant with city listed as Los Angeles. Only restaurants in cities with health data were included in this analysis.

if (!file.exists("LACinspections.csv.zip"))
  download(
    url = "https://raw.githubusercontent.com/v-yin/PM566-Midterm/LACinspections.csv.zip",
    dest = "LACinspections.csv.zip",
    mode="wb"
    )
unzip("LACinspections.csv.zip", exdir="./")
inspect <- read.csv("LACinspections.csv")

if (!file.exists("LAChealth.csv"))
  download(
    url = "https://raw.githubusercontent.com/v-yin/PM566-Midterm/main/LAChealth.csv",
    dest = "LAChealth.csv",
    mode="wb"
    )
health <- read.csv("LAChealth.csv")

health$GEONAME <- toupper(health$GEONAME)
health$GEONAME <- replace(health$GEONAME, health$GEONAME=="LOS ANGELES, CITY OF", "LOS ANGELES")

resthealth <- merge(
  x = inspect,
  y = health,
  by.x = "FACILITY_CITY",
  by.y = "GEONAME",
  all.x = FALSE,
  all.y = FALSE
)
resthealth <- data.table(resthealth)

Data Exploration

Data were explored utilizing R package ggplot2 to create a histogram of scores. Implausibly low scores were deleted (score less than 50, of which there was 1 value with score 3). Average scores within cities were computed and compared. Restaurant chains were identified through tokenizing words as bigrams, and looking to see most common chain restaurants. The following set of 9 chains were selected: McDonald’s, Jack in the Box, Starbucks, El Pollo Loco, Panda Express, Taco Bell, Del Taco, In N Out, Panera Bread. Average chain inspection scores were computed and compared. Measures of public health were selected: proportion with depression, proportion with obesity, proportion with diabetes. To explore a chain restaurant inspections scores with its surrounding city’s health, average score for a chain restaurant was calculated.

In the future, I hope to use OpenStreetMaps API to gather latitude and longitude data for each restaurant. For now, I was able to use geocode_zip to gather lattitude and longitude for most zip codes in the dataset. The zip codes’ average inspection score, proportion of diabetes, obesity, and depression were calculated. Heat maps of the averages were created to compare spatially where clean/dirty restaurants are in relation to proportion of the population with Diabetes, obesity, or depression.

# Delete scores less than 50
library(data.table)
resthealth <- resthealth[SCORE>50]
# Find average inspection score by city
city_avg <- resthealth[ , .(
  scoreavg = mean(SCORE) 
), by = "FACILITY_CITY"]
# Find restaurant chains
resthealth %>% unnest_ngrams(word, FACILITY_NAME, n=2) %>% anti_join(stop_words, by = c("word")) %>% count(word, sort=TRUE) %>% as_tibble() %>% print(n=100)
## # A tibble: 40,730 × 2
##     word                     n
##     <chr>                <int>
##   1 <NA>                 16432
##   2 ralphs market         2066
##   3 7 eleven              2009
##   4 meat market           1907
##   5 mexican grill         1488
##   6 in the                1444
##   7 the box               1441
##   8 el pollo              1416
##   9 jack in               1408
##  10 starbucks coffee      1343
##  11 pollo loco            1297
##  12 mini market           1250
##  13 panda express         1176
##  14 vons market           1151
##  15 mexican food          1124
##  16 taco bell              969
##  17 4 less                 948
##  18 food 4                 924
##  19 eleven store           855
##  20 los angeles            826
##  21 chipotle mexican       804
##  22 whole foods            800
##  23 pizza hut              793
##  24 el super               774
##  25 domino's pizza         760
##  26 cvs pharmacy           759
##  27 little caesars         737
##  28 carl's jr              729
##  29 foods market           728
##  30 fried chicken          717
##  31 burger king            707
##  32 fish market            702
##  33 ice cream              684
##  34 superior grocers       663
##  35 dodger stadium         656
##  36 waba grill             656
##  37 stater bros            651
##  38 smart final            616
##  39 ranch market           588
##  40 mexican restaurant     585
##  41 farmers market         554
##  42 wal mart               549
##  43 hawaiian bbq           543
##  44 afc sushi              519
##  45 bakery cafe            516
##  46 del taco               516
##  47 la michoacana          514
##  48 subway sandwiches      514
##  49 sprouts farmers        497
##  50 liquor market          494
##  51 tacos el               491
##  52 northgate market       483
##  53 costco wholesale       481
##  54 thai restaurant        478
##  55 bar grill              469
##  56 thai cuisine           469
##  57 hong kong              465
##  58 jamba juice            444
##  59 jersey mike's          444
##  60 house of               441
##  61 rite aid               440
##  62 chinese food           438
##  63 baskin robbins         435
##  64 the coffee             433
##  65 dollar tree            426
##  66 mike's subs            422
##  67 99 cents               421
##  68 numero uno             415
##  69 coffee bean            409
##  70 center levy            403
##  71 staples center         403
##  72 tea leaf               403
##  73 flame broiler          402
##  74 vallarta supermarket   401
##  75 bean tea               400
##  76 coffee shop            396
##  77 juice bar              381
##  78 beverly hills          375
##  79 mariscos el            368
##  80 the flame              365
##  81 japanese restaurant    360
##  82 in n                   356
##  83 n out                  356
##  84 korean bbq             354
##  85 fast food              351
##  86 out burger             350
##  87 drive in               347
##  88 and grill              341
##  89 burger grill           334
##  90 country club           329
##  91 mini mart              323
##  92 yum yum                318
##  93 seafood restaurant     311
##  94 am pm                  310
##  95 the habit              308
##  96 noodle house           307
##  97 santa monica           307
##  98 bristol farms          304
##  99 gelson's market        303
## 100 sam's club             302
## # … with 40,630 more rows
# Create chain variable
resthealth$FACILITY_NAME <- toupper(resthealth$FACILITY_NAME)
resthealth$CHAIN <- ifelse(grepl("MCDONALD", resthealth$FACILITY_NAME), "McDonald's", ifelse(grepl("JACK IN THE BOX", resthealth$FACILITY_NAME), "Jack in the Box", ifelse(grepl("STARBUCKS", resthealth$FACILITY_NAME), "Starbucks", ifelse(grepl("EL POLLO LOCO", resthealth$FACILITY_NAME), "El Pollo Loco", ifelse(grepl("PANDA EXPRESS", resthealth$FACILITY_NAME), "Panda Express", ifelse(grepl("TACO BELL", resthealth$FACILITY_NAME), "Taco Bell", ifelse(grepl("DEL TACO", resthealth$FACILITY_NAME), "Del Taco", ifelse(grepl("OUT BURGER", resthealth$FACILITY_NAME), "In N Out", ifelse(grepl("PANERA BREAD", resthealth$FACILITY_NAME), "Panera Bread", NA)))))))))
# Find average inspection score by city
chain_avg <- resthealth[ , .(
  scoreavg = mean(SCORE) 
), by = "CHAIN"]
# Clean health outcome data to be numeric
resthealth$Prop_obse <- as.numeric(as.character(resthealth$Prop_obse))
## Warning: NAs introduced by coercion
resthealth$Prop_DM <- as.numeric(as.character(resthealth$Prop_DM))
## Warning: NAs introduced by coercion
# Average score by restaurant
zip_avg <- resthealth[ , (ZipAvg = mean(SCORE, by = "FACILITY_ZIP")), list(FACILITY_NAME, FACILITY_ADDRESS, FACILITY_CITY, FACILITY_STATE, FACILITY_ZIP, Prop_DM, Prop_obse, Prop_depr) ]
zip_avg$FACILITY_ZIP <- as.character(zip_avg$FACILITY_ZIP)
# geocode to get lattitude and longitude by zip code
library(zipcodeR)
zip_cord <- geocode_zip(zip_avg$FACILITY_ZIP)
zip <- merge(
  x = zip_avg,
  y = zip_cord,
  by.x = "FACILITY_ZIP",
  by.y = "zipcode",
  all.x = TRUE,
  all.y= FALSE
)

# attempt to get restaurant lat/lng
# failed: geocode_osm , can't install SUNGEO
# failed: tidygeocoder,  takes hours
# myaddress.df <- resthealth[, query := paste0(FACILITY_ADDRESS, ", ", FACILITY_CITY, ", CA ", FACILITY_ZIP)]

# address <- geo(address = myaddress.df$query, method = "osm")

Preliminary Results

In total, there were 252773 inspections of 39423 restaurants in 66 cities within LA County.

Of the 252773 inspections included in the analysis, the average grade was 93.9351117 with a standard deviation of 3.8669866. The highest score was a perfect score, 100 whereas the lowest score was 54. Interestingly, there appears to be a peak at 90, which corresponds to the lowest possible score to achieve an A rating. This may hint at bias involved in the inspection grading process.

library(ggplot2)
scorehisto <- ggplot(resthealth, aes(x=SCORE)) +
  geom_histogram(binwidth=1, fill="red") +
  ggtitle("Distribution of LA County Restaurant Inspection Scores, 2017-2022") +
  xlab("Inspection Score")
scorehisto

Of the 66 cities included in this analysis, Long Beach had the highest average city inspection score. The city with the worst average score was Monterey Park.

city_avg %>% arrange(desc(scoreavg)) %>% slice(1:5) %>% knitr::kable(col.names = c("City", "Average Score"), caption = "Top 5 Cities")
Top 5 Cities
City Average Score
LONG BEACH 97.37500
CALABASAS 96.03195
SANTA CLARITA 95.74072
MAYWOOD 95.60426
EAST LOS ANGELES 95.49254
city_avg %>% arrange(scoreavg) %>% slice(1:5) %>% knitr::kable(col.names = c("City", "Average Score"), caption = "Bottom 5 Cities")
Bottom 5 Cities
City Average Score
MONTEREY PARK 91.63380
ROWLAND HEIGHTS 91.68030
ALHAMBRA 92.36034
GARDENA 92.36540
CERRITOS 92.59487

Of the chain restaurants examined, Panda Express had the best average score, whereas Del Taco had the lowest average score. Cities with Starbucks and Panera appeared to have the lowest proportion of diabetes. However, depression was highest in cities with Starbucks and Panda Express. Cities with Del Taco and Panera Bread had the highest amount of obesity.

chain_avg %>% na.omit() %>% arrange(desc(scoreavg)) %>% slice(1:10) %>% knitr::kable(col.names = c("Chain", "Average Score"), caption = "Chain Ranking")
Chain Ranking
Chain Average Score
Panda Express 97.01446
In N Out 96.89239
Starbucks 96.62458
Taco Bell 96.52941
McDonald’s 95.15971
El Pollo Loco 94.92152
Jack in the Box 94.79830
Panera Bread 94.62241
Del Taco 93.16085
resthealth[ , .(
  avgDM = mean(Prop_DM, na.rm = T),
  avgMDD = mean(Prop_depr, na.rm = T),
  avgOB = mean(Prop_obse, na.rm = T)
), by = "CHAIN"] %>% na.omit() %>% as_tibble() %>% knitr::kable(col.names= c("Chain", "Average Proportion Diabetes", "Average Proportion Depression", "Average Proportion Obesity"), caption = "Proportion of Diabetes, Depression, and Obesity in surrounding city by chain restaurant")
Proportion of Diabetes, Depression, and Obesity in surrounding city by chain restaurant
Chain Average Proportion Diabetes Average Proportion Depression Average Proportion Obesity
Starbucks 0.0984833 0.0875348 0.2299143
Taco Bell 0.1034938 0.0831847 0.2472439
McDonald’s 0.1036016 0.0826779 0.2438592
In N Out 0.1005722 0.0750168 0.2412087
Panda Express 0.1010105 0.0841394 0.2398609
Jack in the Box 0.1053523 0.0794227 0.2500741
Panera Bread 0.0976643 0.0812643 0.2606151
El Pollo Loco 0.1034037 0.0799796 0.2445605
Del Taco 0.1051764 0.0806202 0.2659273

Is there a correlation between restaurant inspection score and surrounding city health?

While most zip codes appear to have a decent average health inspection score (90+), there appear to be some areas where scores are lower. These include El Monte/Covina, Compton, and Central LA (West Adams). There even appears to be one zip code in Santa Monica with a poor average inspection score. In the following maps, I will explore spatial patterns in proportion of diabetes, obesity, and depression to see if it mirrors patterns found in the food inspection scores by zip code.

score.pal <- colorNumeric("inferno", domain=zip$V1)
scoremap <- leaflet(zip) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircles(
    lat = ~lat, lng = ~lng,
    label = ~paste0(V1), color = ~score.pal(V1),
    opacity = 1, fillOpacity = 1, radius = 200
  ) %>%
  addLegend("bottomright", pal=score.pal, values = zip$V1, title= "Inspection Score", opacity=1) %>% addTiles() %>%
  addControl("Average Inspection Scores by Zip Code", position = "bottomleft")
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
scoremap

The distribution of the population proportion of diabetes in LA County has a clearer spatial pattern than those of restaurant inspection scores. It appears that diabetes is lowest in communities bordering the ocean and Glendale. Diabetes is most prevalent in South/Southwest/Southeast Los Angeles. These are also some of the communities where restaurant scores are lower.

DM.pal <- colorNumeric("viridis", domain=zip$Prop_DM)
DMmap <- leaflet(zip) %>%
    addProviderTiles("CartoDB.Positron") %>%
  addCircles(
    lat = ~lat, lng = ~lng,
    label = ~paste0(Prop_DM), color = ~DM.pal(Prop_DM),
    opacity = 1, fillOpacity = 1, radius = 200
  ) %>%
  addLegend("bottomright", pal=DM.pal, values = zip$Prop_DM, title= "Proportion in City with Diabetes", opacity=1) %>% addTiles()
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
DMmap

Obesity proportion by city follows patterns similar to that of Diabetes. It appears that the city of Inglewood has one of the highest proportion of obesity, and it also does have poorer food inspection grades. South LA appears to be one of the areas with worst restaurant inspection scores and highest in comorbidities.

ob.pal <- colorNumeric("plasma", domain=zip$Prop_obse)
obmap <- leaflet(zip) %>%
    addProviderTiles("CartoDB.Positron") %>%
  addCircles(
    lat = ~lat, lng = ~lng,
    label = ~paste0(Prop_obse), color = ~ob.pal(Prop_obse),
    opacity = 1, fillOpacity = 1, radius = 200
  ) %>%
  addLegend("bottomright", pal=ob.pal, values = zip$Prop_obse, title= "Proportion in City with Obesity", opacity=1) %>% addTiles()
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
obmap

Interestingly, depression appears to take a different spatial pattern than diabetes and obesity. Here, depression is clustered around Santa Monica, with one zip code highly prevalent in depression near Inglewood. It does not seem that restaurant hygeine correlates strongly with depression proportions.

dep.pal <- colorNumeric("BuPu", domain=zip$Prop_depr)
depmap <- leaflet(zip) %>%
    addProviderTiles("CartoDB.Positron") %>%
  addCircles(
    lat = ~lat, lng = ~lng,
    label = ~paste0(Prop_depr), color = ~dep.pal(Prop_depr),
    opacity = 1, fillOpacity = 1, radius = 200
  ) %>%
  addLegend("bottomright", pal=dep.pal, values = zip$Prop_depr, title= "Proportion in City with Depression", opacity=1) %>% addTiles()
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
depmap

Conclusion

There appears to be some correlation between poor health and poor restaurant inspection scores. The communities where this was apparent were clustered in South LA (ie Compton, Inglewood), and Eastern Los Angeles. Inglewood appears to be one of the hardest hit cities with regards to poor inspection scores, high proportion of diabetes and obesity, and high proportion of depression. The safest chain restaurant appears to be Panda Express, whereas one should be cautious of Del Taco. San Gabriel Valley cities, like Monterey Park, Alhambra, and Rowland Heights, were found to have the lowest inspection scores in LA County.